Search CORE

22 research outputs found

Vapor SIMD: Auto-Vectorize Once, Run Everywhere

Author: Cohen Albert
Dyshel Sergei
Nuzman Dorit
Rohou Erven
Rosen Ira
Williams Kevin
Yuste David
Zaks Ayal
Publication venue: HAL CCSD
Publication date: 01/01/2011
Field of study

International audienceJust-in-Time (JIT) compiler technology offers portability while facilitating target- and context-specific specialization. Single-Instruction-Multiple-Data (SIMD) hardware is ubiquitous and markedly diverse, but can be difficult for JIT compilers to efficiently target due to resource and budget constraints. We present our design for a synergistic auto-vectorizing compilation scheme. The scheme is composed of an aggressive, generic offline stage coupled with a lightweight, target-specific online stage. Our method leverages the optimized intermediate results provided by the first stage across disparate SIMD architectures from different vendors, having distinct characteristics ranging from different vector sizes, memory alignment and access constraints, to special computational idioms.We demonstrate the effectiveness of our design using a set of kernels that exercise innermost loop, outer loop, as well as straight-line code vectorization, all automatically extracted by the common offline compilation stage. This results in performance comparable to that provided by specialized monolithic offline compilers. Our framework is implemented using open-source tools and standards, thereby promoting interoperability and extendibility

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

Vapor SIMD: Auto-Vectorize Once, Run Everywhere

Author: Cohen Albert
Dyshel Sergei
Nuzman Dorit
Rohou Erven
Rosen Ira
Williams Kevin
Yuste David
Zaks Ayal
Publication venue: HAL CCSD
Publication date: 04/04/2011
Field of study

INRIA a CCSD electronic archive server

ACOTES project: Advanced compiler technologies for embedded streaming

Author: Albert Cohen
Alex Ramírez
Andrea Ornstein
Antoniu Pop
Ayal Zaks
Cupertino Miranda
Cédric Bastoul
David Ródenas
Dorit Nuzman
E. Blossom
E.A. Lee
Eduard Ayguadé
Erven Rohou
Harm Munk
Ira Rosen
J. Hoogerbrugge
Konrad Trifunović
Louis-Noël Pouchet
M. Gschwind
M. Wolfe
Marc Duranton
Marco Cornero
Menno Lindwer
Mohammed Fellahi
Paul Carpenter
Philippe Dumont
R. Allen
R.G. Scarborough
Razya Ladelsky
Roger Ferrer
S. Campanoni
Sebastian Pop
Uzi Shvadron
Xavier Martorell
Zbigniew Chamski
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version

HAL-CentraleSupelec

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

UPCommons. Portal del coneixement obert de la UPC

INRIA a CCSD electronic archive server

HAL-MINES ParisTech

The University of Manchester - Institutional Repository

HAL-Rennes 1

Automatically Harnessing Sparse Acceleration

Sparse linear algebra is central to many scientific programs, yet compilers fail to optimize it well. High-performance libraries are available, but adoption costs are significant. Moreover, libraries tie programs into vendor-specific software and hardware ecosystems, creating non-portable code. In this paper, we develop a new approach based on our specification Language for implementers of Linear Algebra Computations (LiLAC). Rather than requiring the application developer to (re)write every program for a given library, the burden is shifted to a one-off description by the library implementer. The LiLAC-enabled compiler uses this to insert appropriate library routines without source code changes. LiLAC provides automatic data marshaling, maintaining state between calls and minimizing data transfers. Appropriate places for library insertion are detected in compiler intermediate representation, independent of source languages. We evaluated on large-scale scientific applications written in FORTRAN; standard C/C++ and FORTRAN benchmarks; and C++ graph analytics kernels. Across heterogeneous platforms, applications and data sets we show speedups of 1.1

\times

to over 10

\times

without user intervention.Comment: Accepted to CC 202

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

Auto-vectorization of interleaved data for SIMD

Author: Ayal Zaks
Dorit Nuzman
Ira Rosen
Publication venue
Publication date: 01/01/2006
Field of study

been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). Copies may be requested from IBM T. J. Watson Research Center, P

CiteSeerX

Towards a First Vertical Prototyping of an Extremely Fine-grained Parallel Programming Approach

Author: Chau-Wen Tseng
Dorit Naishlos
Joseph Nuzman
Uzi Vishkin
Publication venue
Publication date: 01/01/2001
Field of study

Explicit-multithreading (XMT) is a parallel programming approach for exploiting on-chip parallelism. As such, XMT introduces a computational framework with 1) a simple programming style that relies on finegrained PRAM-style algorithms; 2) hardware support for low-overhead parallel threads, scalable load balancing, and efficient synchronization. The missing link between the algorithmic-programming level and the architecture level is provided by the first prototype XMT compiler. This paper also takes this new opportunity to 1) evaluate the overall effectiveness of the interaction between the programming model and the hardware, and 2) enhance its performance where needed, incorporating new optimizations into the XMT compiler. We present a wide range of applications, which written in XMT obtain significant speedups relative to the best serial programs. Results show the XMT architecture generally succeeds in providing low-overhead parallel threads and uniform access times on-chip. ..

CiteSeerX

Crossref

Evaluating the XMT Parallel Programming Model

Author: Chau-wen Tseng
Dorit Naishlos
Joseph Nuzman
Uzi Vishkin
Publication venue
Publication date: 01/01/2001
Field of study

Explicit-multithreading (XMT) is a parallel programming model designed for exploiting on-chip parallelism. Its features include a simple thread execution model and an efficient prefix-sum instruction for synchronizing shared data accesses. By taking advantage of low-overhead parallel threads and high on-chip memory bandwidth, the XMT model tries to reduce the burden on programmers by obviating the need for explicit task assignment and thread coarsening. This paper presents features of the XMT programming model, and evaluates their utility through experiments on a prototype XMT compiler and architecture simulator. We find the lack of explicit task assignment has slight effects on performance for the XMT architecture. Despite low thread overhead, thread coarsening is still necessary to some extent, but can usually be automatically applied by the XMT compiler. The prefix-sum instruction provides more scalable synchronization than traditional locks, and the simple run-until-completion thread execution model (no busywaits) does not impair performance. Finally, the combination of features in XMT can encourage simpler parallel algorithms that may be more efficient than more traditional complex approaches. 1

CiteSeerX

Crossref